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Abstract 

Recommendation based on user preferences is a common 
task for e-commerce websites. New recommendation al¬ 
gorithms are often evaluated by offline comparison to 
baseline algorithms such as recommending random or the 
most popular items. Here, we investigate how these al¬ 
gorithms themselves perform and compare to the opera¬ 
tional production system in large scale online experiments 
in a real-world application. Specifically, we focus on rec¬ 
ommending travel destinations at Booking.com, a major 
online travel site, to users searching for their preferred va¬ 
cation activities. To build ranking models we use multi¬ 
criteria rating data provided by previous users after their 
stay at a destination. We implement three methods and 
compare them to the current baseline in Booking.com: 
random, most popular, and Naive Bayes. Our general 
conclusion is that, in an online A/B test with live users, 
our Naive-Bayes based ranker increased user engagement 
significantly over the current online system. 

Keywords: Information Search and Retrieval, industrial 
case studies, multi-criteria ranking, travel applications, 
travel recommendations 


1 Introduction 

This paper investigates strategies to recommended 
travel destinations for users who provided a list of 
preferred activities at Booking.com, a major online 
travel agent. This is a complex exploratory recom¬ 
mendation task characterized by predicting user pref¬ 
erences with a limited amount of noisy information. 
In addition, the industrial application setting comes 
with specific challenges for search and recommenda¬ 
tion systems m- 

To motivate our problem set-up, we introduce 
a service which allows to find travel destinations 
based on users’ preferred activities, called destina¬ 
tion finderO Consider a user who knows what activi¬ 
ties she wants to do during her holidays, and is look¬ 
ing for travel destinations matching these activities. 
This process is a complex exploratory recommenda¬ 
tion task in which users start by entering activities in 
the search box as shown in Figure [lj The destination 
finder service returns a ranked list of recommended 
destinations. 

The underlying data is based on reviews from users 
who have booked and stayed at a hotel at some desti- 

1 http://www.booking.com/destinationfinder.html 
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Figure 1: Example of destination finder use: a user 
list of recommended destinations (top 4 are shown). 

nation in the past. After their stay, users are asked to 
endorse the destination with activities from a set of 
‘endorsements’. Initially, the set of endorsements was 
extracted from users’ free-text reviews using a topic- 
modeling technique such as LDA urn- Nowadays, 
the set of endorsements consists of 256 activities such 
as ‘Beach,’ ‘Nightlife,’ ‘Shopping,’ etc. These en¬ 
dorsements imply that a user liked a destination for 
particular characteristics. Two examples of the col¬ 
lected endorsements for two destinations, ‘Bangkok’ 
and ‘London’, are shown in Figure |2j 

As an example of the multi-criteria endorsement 
data, consider three endorsements: e\ = c Beach\ e 2 
= 6 Shopping\ and e% = ‘Family Friendly’ and assume 
that a user uj, after visiting a destination dk (e.g. 
‘London’), provides the review ri(uj,dk) as: 

n(uj,d k ) = (o, i,o). (l) 

This means our user endorses London for ‘Shopping’ 
only. However, we cannot conclude that London 
is not ‘Family Friendly’. Thus, in contrast to the 
ratings data in a traditional recommender systems 
setup, negative user opinions are hidden. In addi¬ 
tion, we are dealing with multi-criteria ranking data. 

In contrast, in classical formulations of Recom¬ 
mender Systems (RS), the recommendation problem 
relies on single ratings (R) as a mechanism of captur¬ 
ing user (U) preferences for different items (/). The 


searching for ‘Nightlife’ and ‘Beach’ obtains a ranked 


problem of estimating unknown ratings is formalized 
as follows: F : U x I —>> R. RS based on latent 
factor models have been effectively used to under¬ 
stand user interests and predict future actions Hi. 
Such models work by projecting users and items into 
a lower-dimensional space, thereby grouping similar 
users and items together, and subsequently comput¬ 
ing similarities between them. This approach can 
run into data sparsity problems, and into a continu¬ 
ous cold start problem when new items continuously 
appear. 

In multi-criteria RS mum the rating function 
has the following form: 

F : V x / —► (r 0 X r t • • • x r n ) (2) 

The overall rating ro for an item shows how well the 
user likes this item, while criteria ratings rq,..., r n 
provide more insight and explain which aspects of 
the item she likes. MCRS predict the overall rating 
for an item based on past ratings, using both overall 
and individual criteria ratings, and recommends to 
users the item with the best overall score. According 
to [pQ , there are two basic approaches to compute the 
final rating prediction in the case when the overall 
rating is known. In our work we consider a new type 
of input for RS which is multi-criteria ranking data 
without an overall rating. 
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London 

United Kingdom, Greater London 


Top reasons to visit: 

e Shopping (49973) 

Sightseeing (34620) 

Museums (23038) Theater (18051) 

Culture (16212) History (12947) 

Monuments (10268) 

Entertainment (7320) © 


Food (8891) 

City Trip (6928) 



Bangkok 

Thailand, Bangkok Province 
Top reasons to visit: 

a Shopping (21396) © Food (7972) 

Nightlife (4872) Temples (3999) 

Clothes Shopping (3223) 

Sightseeing (3215) Culture (2714) 

O Street Food (1930) m Markets (1645) 
Gourmet Food (1353) 


Figure 2: The destination finder endorsement pages 
of London and Bangkok, 


There are a number of important challenges in 
working on the real world application of travel rec¬ 
ommendations. 

First, it is not easy to apply RS methods in large 
scale industrial applications. A large scale applica¬ 
tion of an unsupervised RS is presented in [8], where 
the authors apply topic modeling techniques to dis¬ 
cover user preferences for items in an online store. 
They apply Locality Sensitive Hashing techniques to 
overcome performance issues when computing recom¬ 
mendations. We should take into account the fact 
that if it’s not fast it isn’t working. Due to the vol¬ 
ume of traffic, offline processing—done once for all 
users—comes at marginal costs, but online processing 
—done separately for each user—can be excessively 
expensive. Clearly, response times have to be sub¬ 
second, but even doubling the CPU or memory foot¬ 
print comes at massive costs. 

Second, there is a continuous cold start problem. A 
large fraction of users has no prior interactions, mak¬ 
ing it impossible to use collaborative recommenda¬ 
tion, or rely on history for recommendations. More¬ 
over, for travel sites, even the more active users visit 
only a few times a year and have volatile needs or 
different personas (e.g., business and leisure trips), 
making their personal history a noisy signal at best. 


To summarize, our problem setup is the following: 
(1) we have a set geographical destinations such as 
‘Paris’, ‘London’, ‘Amsterdam’ etc.; and ( 2 ) each 
destination was reviewed by users who visited the 
destination using a set of endorsements. Our main 
goal is to increase user engagement with the travel 
recommendations as indicator of their interest in the 
suggested destinations. 

Our main research question is: How to exploit 
multi-criteria rating data to rank travel destination 
recommendations? Our main contributions are: 

• we use multi-criteria rating data to rank a list of 
travel destinations; 

• we set up a large-scale online A/B testing eval¬ 
uation with live traffic to test our methods; 

• we compared three different rankings against 
the industrial baseline and obtained a significant 
gain in user engagement in terms of conversion 
rates. 

The remainder of the paper is organized as follows. 
In Section [2j we introduce our strategies to rank des¬ 
tinations recommendations. We present the results of 
our large-scale online A/B testing in Section [3] Fi¬ 
nally, Section [4] concludes our work in this paper and 
highlights a few future directions. 


Ranking Destination Recom¬ 
mendations 


In this section, we present our ranking approaches 
for recommendations of travel destinations. We first 
discuss our baseline, which is the current produc¬ 
tion system of the destination finder at Booking. com, 
Then, we discuss our first two approaches, which are 
relatively straightforward and mainly used for com¬ 
parison: the random ranking of destinations (Sec¬ 
tion [T2|), and the list of the most popular destina¬ 


tions (Section 2.3). Finally, we will discuss a Naive 
Bayes ranking approach to exploit the multi-criteria 
ranking data. 
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2.1 Booking.com Baseline 

We use the currently live ranking method at 
Booking.corn’s destination finder as a main baseline. 
We are not able to disclose the details, but the base¬ 
line is an optimized machine learning approach, using 
the same endorsement data plus some extra features 
not available to our other approaches. 

We refer further to this method as ‘Baseline’. 
Next, we present two widely eployed baselines, 
which we use to give an impression how the base¬ 
line performs. Then we introduce an application of 
the Naive Bayes ranking approach to multi-criteria 
ranking. 

2.2 Random Destination ranking 

We retrieve all destinations that are endorsed at least 
for one of the activities that the user is searching 
for. The retrieved list of destinations is randomly 
permuted and is shown to users. 

We refer further to this method as ‘Random’. 

2.3 Most Popular Destinations 

A very straightforward and at the same time very 
strong baseline would be the method that shows to 
users the most popular destinations based on their 
preferences [h. For example, if the user searches for 
the activity ‘Beach’, we calculate the popularity rank 
score for a destination d{ as the conditional proba¬ 
bility: P(Beach|di). If the user searches for a sec¬ 
ond endorsement, e.g. ‘Food’, the ranking score for 
di is calculated using a Naive Bayes assumption as: 
P(Beach|d^) xP(food|d^). In general, if the users pro¬ 
vides n endorsements, ei,... ,e n , the ranking score 
for di is P(ei\di) x ... x P(e n |d»). 

We refer further to this method as ‘Popularity’. 

2.4 Naive Bayes Ranking Approach 

As a primary ranking technique we use a Naive Bayes 
approach. We will describe its application to the 
multi-criteria ranking data (presented in Equation [TJ 
with an example. Let us again consider a user search¬ 
ing for ‘Beach’. We need to return a ranked list of 


destinations. For instance, the ranking score for the 
destination ‘Miami’ is calculated as 

P(Miami, Beach) = P(Miami) x P(Beach|Miami), (3) 

where P(Beach|Miami) is the probability that the 
destination Miami gets the endorsement ‘Beach’. 
P(Miami) describes our prior knowledge about Mi¬ 
ami. In the simplest case this prior is the ratio of 
the number of endorsements for Miami to the total 
number of endorsements in our database. 

If a user uses searches for a second activity, e.g. 
‘Food’, the ranking score is calculated in the following 
way: 

P(Miami, Beach, Food) = P(Miami) x P(Beach|Miami) 

xP (Food | Miami) 

(4) 

If our user provides n endorsements, Equation [4] be¬ 
comes a standard Naive Bayes formula. 

We refer further to this method as ‘Naive Bayes’. 

To summarize, we described three strategies to 
rank travel destination recommendations: the ran¬ 
dom ranking, the popularity based ranking, and the 
Naive Bayes approach. These three approaches will 
be compared to each other and against the indus¬ 
trial baseline. Next, we will present our experimen¬ 
tal pipeline which involves online A/B testing at the 
destination finder service of Booking.com. 

3 Experiments and Results 

In this section we will describe our experimental 
setup and evaluation approach, and the results of the 
experiments. We perform experiments on users of 
Booking.com where an instance of the destination 
finder is running in order to conduct an online eval¬ 
uation. First, we will detail our online evaluation 
approach and used evaluation measures. Second, we 
will detail the experimental results. 

3.1 Research Methodology 

We take advantage of a production A/B testing envi¬ 
ronment at Booking.com, which performs random¬ 
ized controlled trials for the purpose of inferring 
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causality. A/B testing randomly splits users to see 
either the baseline or the new variant version of the 
website, which allows to measure the impact of the 
new version directly on real users mmm- 

As our primary evaluation metric in the A/B test, 
we use conversion rate, which is the fraction of ses¬ 
sions which end with at least one clicked result P2- 
As explained in the motivation, we are dealing with 
an exploratory task and therefore aim to increase cus¬ 
tomer engagement. An increase in conversion rate is 
a signal that users click on the suggested destinations 
and thus interact with the system. 

In order to determine whether a change in conver¬ 
sion rate is a random statistical fluctuation or a sta¬ 
tistically significant change, we use the G-test statis¬ 
tic (G-tests of goodness-of-fit). We consider the dif¬ 
ference between the baseline and the newly proposed 
method significant when the G-test p-value is larger 
than90%. 

3.2 Results 

Conversion rate is the probability for a user to click 
at least once, which is a common metric for user en¬ 
gagement. We used it as a primary evaluation metric 
in our experimentation. Table [l] shows the results 
of our A/B test. The production ‘Baseline’ substan¬ 
tially outperforms the ‘Random’ ranking with respect 
to conversion rate, and performs slightly (but not 
significantly) better than the ‘Popularity’ approach. 
The ‘Naive Bayes’ ranker significantly increases the 
conversion rate by 4.4% compared to the production 
baseline. 

We achieved this substantial increase in conver¬ 
sion rate with a straightforward Naive Bayes ranker. 
Moreover, most computations can be done offline. 
Thus, our model could be trained on large data 
within reasonable time, and did not negatively im¬ 
pact wallclock and CPU time for the destination 
finder web pages in the online A/B test. This is cru¬ 
cial for a webscale production environment m ■ 

To summarize, we used three approaches to rank 
travel recommendations. We saw that the random 
and popularity based ranking of destinations lead to 


a decrease in user engagement, while the Naive Bayes 
approach leads to a significant engagement increase. 


4 Conclusion and Discussion 

This paper reports on large-scale experiments with 
four different approaches to rank travel destination 
recommendations at Booking.com, a major online 
travel agent. We focused on a service called destina¬ 
tion finder where users can search for suitable desti¬ 
nation based on preferred activities. In order to build 
ranking models we used multi-criteria rating data in 
the form of endorsements provided by past users after 
visiting a booked place. 

We implemented three methods to rank travel des¬ 
tinations: Random, Most Popular, and Naive Bayes, 
and compared them to the current production base¬ 
line in Booking.com. We observed a significant in¬ 
crease in user engagement for the Naive Bayes rank¬ 
ing approach, as measured by the conversion rate. 
The simplicity of our recommendation models en¬ 
ables us to achieve this engagement without signif¬ 
icantly increasing online CPU and memory usage. 
The experiments clearly demonstrate the value of 
multi-criteria ranking data in a real world applica¬ 
tion. They also shows that simple algorithmic ap¬ 
proaches trained on large data sets can have very 
good real-life performance [? ]. 

We are working on a number of extension of the 
current work, in particular on contextual recommen¬ 
dation approaches that take into account the con¬ 
text of the user and the endorser, and on ways to 
detect user profiles from implicit contextual informa¬ 
tion. Initial experiments with contextualized recom¬ 
mendations show that this can lead to significant fur¬ 
ther improvements of user engagement. 

Some of the authors are involved in the organi¬ 
zation of the TREC Contextual Suggestion Track 
HE [71 [15], and the use case of the destination finder 
is part of TREC in 2015, where similar endorsements 
are collected. The resulting test collection can be 
used to evaluate destination and venue recommenda¬ 
tion approaches. 
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Table 1: Results of the destination finder online A/B testing based on the number of unique users and 
clickers. 


Ranker type 

Number of users 

Conversion rate 

G-test 

Baseline 

9.928 

25.61% ± 0.72% 


Random 

10.079 

24.46% ± 0.71% 

94% 

Popularity 

9.838 

25.50% ± 0.73% 

41% 

Naive Bayes 

9.895 

26 . 73 % ± 0 . 73 % 

93% 
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